Problem #1

Part A/B

Hypothesize a linear regression relationship:

I wanted to build on my previous hypothesis that distance could be used to predict the fare of a route by adding the number of passengers who fly the route per day on average. I feel like more popular flights would be cheaper than those with low flight traffic. Additionally I felt that the best 3rd explanatory variable to include in this analysis was the relationship between distance and passengers. The other options based on the provided dataset just didn’t seem to mesh as well with the two that I have included already.

\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{\beta_1}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{distance} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{passen} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{dist:passen} + \epsilon_i \]

Multiple Regression results

Below is the Multiple regression result using distance, passengers, and the distance/passenger relationship.

lm.mult <-lm(fare ~ dist + passen + dist:passen, data=IO_airfare)
summary(lm.mult) %>%
pander(caption= "HW 3 Simple Multiple regression results")
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 118.7 2.046 58 0
dist 0.06534 0.001711 38.2 1.803e-277
passen -0.0212 0.001724 -12.3 3.192e-34
dist:passen 1.537e-05 1.51e-06 10.18 4.511e-24
HW 3 Simple Multiple regression results
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
4595 57.62 0.4083 0.4079

Below is the Multiple regression result using distance, passengers, but without the distance/passenger relationship.

lm.mult2 <-lm(fare ~ dist + passen, data=IO_airfare)
summary(lm.mult2) %>%
pander(caption= "HW 3 Simple Multiple regression w/o Interaction")
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 108.8 1.823 59.69 0
dist 0.07541 0.001411 53.45 0
passen -0.007297 0.001063 -6.864 7.574e-12
HW 3 Simple Multiple regression w/o Interaction
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
4595 58.26 0.395 0.3947
HW 2 Simple Regression Results

Here are the results from HW 2 regression, prediction of fare using just distance.

lm.sim <-lm(fare ~ dist, data=IO_airfare)
summary(lm.sim) %>%
pander(caption= "HW 2 simple regression results")
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 103.3 1.643 62.87 0
dist 0.07631 0.001412 54.05 0
HW 2 simple regression results
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
4595 58.55 0.3888 0.3886
Completed Regression Equation

Here is the original equation for the regression with the appropriate coefficients now included.

\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{118.7}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{0.06534}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{distance} + \overbrace{-0.0212}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{passen} + \overbrace{1.537e-05}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{dist:passen} + \epsilon_i \]

Plot

#b <- coef(lm.mult)
## Hint: library(car) has a scatterplot 3d function which is simple to use
#  but the code should only be run in your console, not knit.

#library(car)
#scatter3d(fare ~ dist + passen, data=IO_airfare)



## To embed the 3d-scatterplot inside of your html document is harder.


#Perform the multiple regression

#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.5

#Setup Axis
axis_x <- seq(min(IO_airfare$dist), max(IO_airfare$dist), by = graph_reso)
axis_y <- seq(min(IO_airfare$passen), max(IO_airfare$passen), by = graph_reso)

#Sample points
lmnew <- expand.grid(dist = axis_x, passen = axis_y, KEEP.OUT.ATTRS=F)
lmnew$Z <- predict.lm(lm.mult, newdata = lmnew)
lmnew <- acast(lmnew, passen ~ dist, value.var = "Z") #y ~ x

#Create scatterplot
plot_ly(IO_airfare, 
        x = ~dist, 
        y = ~passen, 
        z = ~fare,
        text = rownames(IO_airfare), 
        type = "scatter3d", 
        mode = "markers", color = ~fare)
  #add_trace(z = lmnew,
   #         x = axis_x,
    #        y = axis_y,
     #       type = "surface")

Part C

Interpretation

Based on the multiple regression, the base cost of a ticket would be $118.70, for each additional mile the fare would increase by $0.065 and for each additional passenger the fare would decrease by $0.021. The strength or the relationship between Distance and passengers is ~0. The P-values for each of these terms are all incredibly close to 0. Although it is worth noting that the probability of the distance variable is significantly lower than that of the passengers or relationship, it is much more powerful in estimating fare than passenger count or the relationship.

These relationships are visble best when viewing the 3d plot. It is quickly apparent that distance is a signicant estimator due to the clustering of points along its distance plane.

Assumptions
par(mfrow=c(1,3))
plot(lm.mult,which=1:2)
plot(lm.mult$residuals)

Problem 2

Sources of OLS estimator Bias

Heteroskedasticity:
Omitting an important variable:
A sample correlation coefficient of 0.395 between two independent variables both included in the model:

Problem 3

How can it be that the R2 is smaller when the variable age is added to the equation? The above equations were estimated using the data in LAWSCH85 from your book.